Sentiments and Topics in South African SONA Speeches

~ STA5073Z Data Science for Industry Assignment 2

Authors

Jared Tavares (TVRJAR001)

Heiletjé van Zyl (VZYHEI003)

Abstract

Introduction

The field of Natural Language Processing (NLP) is faceted by techniques tailored for theme tracking and opinion mining which merge part of text analysis. Though, of particular prominence, is the extraction of latent thematic patterns and the establishment of the extent of emotionality expressed in political-based texts.

Given such political context, it is of specific interest to analyse the annual State of the Nation Address (SONA) speeches delivered by six different South African presidents (F.W. de Klerk, N.R. Mandela, T.M. Mbeki, K.P. Motlanthe, J.G. Zuma, and M.C. Ramaphosa) ranging over twenty-nine years (from 1994 to 2023). This analysis, descriptive and data-driven in nature, endeavours to examine the content of the SONA speeches in terms of themes via topic modelling (TM) and emotions via sentiment analysis (SentA). In general, as illustrated in Figure 1, this exploration will be double-bifurcated, executing the aforementioned techniques within a macro and micro context both at the text (all-presidents versus by-president SONA speeches, respectively) and token (sentences versus words, respectively) level.

Figure 1: Illustration of how the NLP techniques, sentiment analysis and topic modelling, will be implemented within a different-scales-within-different-levels framework for SONA-speeech text analysis.

Through such a multi-layered lens, the identification of any trends, both in terms of topics and sentiments, over time at both a large (presidents as a collective) as well as at a small (each president as an individual) scale is attainable. This explicates not only an aggregated perspective of the general political discourse prevailing within South Africa, but also a more niche outlook of the specific rhetoric employed by each of the country’s serving presidents during different date periods.

To achieve all of the above-mentioned analysis, it is first relevant to revise foundational terms and review related literature in context of politics and NLP. All pertinent pre-processing of the political text data is then considered, followed by a discussion delving into the details of each SentA and TM approach applied as part of the analysis. Specifically, three different lexicons are leveraged to describe sentiments, whilst five different topic models are tackled to uncover themes within South-African-presidents’ SONA speeches. Ensuing the implementation of these methodologies, the results thereof are detailed in terms insights and interpretations. Thereafter, an overall evaluation of the techniques in terms of efficacy and inadequacy is overviewed. Finally, focal findings are highlighted and potential improvements as part of future research are recommended.

Methods

Topic modelling

Latent Semantic Analysis (LSA)

Schematic representation of LSA.

LSA (Deerwester et al. 1990) is a non-probabilistic, non-generative model where a form of matrix factorization is utilized to uncover few latent topics, capturing meaningful relationships among documents/tokens. As depicted in Figure, in the first step, a document-term matrix DTM is generated from the raw text data by tokenizing d documents into w words (or sentences), forming the columns and rows respectively. Each row-column entry is either valued via the BoW or tf-idf approach. This DTM-matrix, which is often sparse and high-dimensional, is then decomposed via a dimensionality-reduction-technique, namely truncated Singular Value Decomposition (SVD). Consequently, in the second step the DTM-matrix becomes the product of three matrices: the topic-word matrix At* (for the tokens), the topic-prevalence matrix Bt* (for the latent semantic factors), and the transposed document-topic matrix CTt* (for the document). Here, t*, the optimal number of topics, is a hyperparameter which is refined at a value (either via the Silhouette-Coefficient or the coherence-measure approach) that retains the most significant dimensions in the transformed space. In the final step, the text data is then encoded using this top-topic number.

Given LSA only implicates a DTM-matrix, the implementation thereof is generally efficient. Though, with the involvement of truncated SVD, some computational intensity and a lack of quick updates with new, incoming text-data can arise. Additional LSA drawbacks include: the lack of interpretability, the underlying linear-model framework (which results in poor performance on text-data with non-linear dependencies), and the underlying Gaussian assumption for tokens in documents (which may not be an appropriate distribution).

Probabilistic Latent Semantic Analysis (pLSA)

Instead of implementing truncated SVD, pLSA (Hofmann 1999) rather utilizes a generative, probabilistic model. Within this framework, a document d is first selected with probability P(d). Then given this, a latent topic t is present in this selected document d and so chosen with probability of P(t|d). Finally, given this chosen topic t, a word w (or sentence) is generated from it with probability P(w|t), as shown in Figure. It is noted that the values of P(d) is determined directly from the corpus D which is defined in terms of a DTM matrix. In contrast, the probabilities P(t|d) and P(w|t) are parameters modelled as multinomial distributions and iteratively updated via the Expectation-Maximization (EM) algorithm. Direct parallelism between LSA and pLSA can be drawn via the methods’ parameterization, as conveyed via matching colours of the topic-word matrix and P(w|t), the document-topic matrix and P(d|t) as well as the topic-prevalence matrix and P(t) displayed in Figure and Figure, respectively.

Despite pLSA implicitly addressing LSA-related disadvantages, this method still involves two main drawbacks. There is no probability model for the document-topic probabilities P(t|d), resulting in the inability to assign topic mixtures to new, unseen documents not trained on. Model parameters also then increase linearly with the number of documents added, making this method more susceptible to overfitting.

Latent Dirichlet Allocation

Schematic representation of LDA.

LDA is another generative, probabilistic model which can be deemed as a hierarchical Bayesian version of pLSA. Via explicitly defining a generative model for the document-topic probabilities, both the above-mentioned pitfalls of pLSA are improved upon. The number of parameters to estimate drastically decrease and the ability to apply and generalize to new, unseen documents is attainable. As presented in Figure, the initial steps first involve randomly sampling a document-topic probability distribution (\(\theta\)) from a Dirichlet (Dir) distribution (\(\eta\)), followed by randomly sampling a topic-word probability distribution (\(\phi\)) from another Dirichlet distribution (\(\tau\)). From the \(\theta\) distribution, a topic t is selected by drawing from a multinomial (Mult) distribution (third step) and from the \(\phi\) distribution given said topic t, a word w (or sentences) is sampled from another multinomial distribution (fourth step). The associated LDA-parameters are then estimated via a variational expectation maximization algorithm or collapsed Gibbs sampling.

Correlated Topic Model (CTM)

Following closely to LDA, the CTM (Lafferty and Blei 2005) additionally allows for the ability to model the presence of any correlated topics. Such topic correlations are introduced via the inclusion of the multivariate normal (MultNorm) distribution with t length-vector of means (\(\mu\)) and t \(\times\) t covariance matrix (\(\Sigma\)) where the resulting values are then mapped into probabilities by passing through a logistic (log) transformation. Comparing Figure and Figure, the nuance between LDA and CTM is highlighted in light-blue, where the discrepancy in the models come about from replacing the Dirichlet distribution (which involves the implicit assumption of independence across topics) with the logit-normal distribution (which now explicitly enables for topic dependency via a covariance structure) for generating document-topic probabilities. The other generative processes previously outlined for LDA is retained and repeated for CTM. Given this additional model complexity, the more convoluted mean-field variational inference algorithm is employed for CTM-parameter estimation which necessitate many iterations for optimization purposes. CTM is consequently computationally more expensive than LDA. Though, this snag is far outweighed by the procurement of richer topics with overt relationships acknowledged between these.

Read in the data

Exploratory Data Analysis

'saved_plots/top_10_words_across_speeches_chart.pdf'

Sentiment analysis

Topic modelling

LSA

(0, '0.267*"year" + 0.242*"government" + 0.198*"work" + 0.195*"south" + 0.188*"people" + 0.163*"country" + 0.145*"development" + 0.142*"national" + 0.140*"programme" + 0.134*"african"')
(1, '-0.169*"government" + 0.146*"south" + -0.142*"regard" + 0.135*"year" + -0.134*"people" + 0.115*"energy" + 0.114*"000" + -0.113*"shall" + -0.112*"ensure" + -0.102*"question"')
(2, '0.140*"honourable" + 0.131*"programme" + -0.125*"pandemic" + 0.123*"continue" + -0.115*"new" + 0.110*"development" + 0.109*"rand" + -0.107*"great" + 0.106*"compatriot" + -0.102*"investment"')
(3, '0.337*"alliance" + 0.240*"transitional" + 0.204*"party" + 0.204*"constitution" + 0.156*"zulu" + 0.155*"constitutional" + 0.131*"south" + 0.126*"concern" + 0.125*"election" + 0.122*"freedom"')
(4, '-0.219*"shall" + 0.204*"people" + -0.148*"year" + 0.144*"alliance" + -0.130*"start" + 0.101*"government" + 0.097*"address" + 0.093*"transitional" + -0.088*"community" + -0.088*"citizen"')

pLSA (Probabilistic Latent Semantic Analysis)

[(0,
  '0.001*"year" + 0.001*"government" + 0.001*"work" + 0.001*"south" + 0.001*"people" + 0.001*"country" + 0.001*"development" + 0.001*"national" + 0.001*"programme" + 0.001*"continue"'),
 (1,
  '0.001*"year" + 0.000*"south" + 0.000*"government" + 0.000*"work" + 0.000*"country" + 0.000*"african" + 0.000*"people" + 0.000*"africa" + 0.000*"development" + 0.000*"programme"'),
 (2,
  '0.001*"government" + 0.001*"year" + 0.000*"people" + 0.000*"south" + 0.000*"work" + 0.000*"country" + 0.000*"ensure" + 0.000*"african" + 0.000*"programme" + 0.000*"service"'),
 (3,
  '0.001*"government" + 0.001*"year" + 0.001*"people" + 0.001*"south" + 0.001*"work" + 0.000*"african" + 0.000*"country" + 0.000*"national" + 0.000*"africa" + 0.000*"development"'),
 (4,
  '0.000*"year" + 0.000*"work" + 0.000*"government" + 0.000*"people" + 0.000*"development" + 0.000*"national" + 0.000*"country" + 0.000*"south" + 0.000*"african" + 0.000*"programme"')]

LDA (Latent Dirichlet Allocation)

[(0,
  '0.001*"year" + 0.001*"south" + 0.000*"government" + 0.000*"work" + 0.000*"people" + 0.000*"development" + 0.000*"african" + 0.000*"country" + 0.000*"national" + 0.000*"africa"'),
 (1,
  '0.001*"year" + 0.001*"people" + 0.001*"government" + 0.000*"country" + 0.000*"national" + 0.000*"development" + 0.000*"work" + 0.000*"african" + 0.000*"ensure" + 0.000*"south"'),
 (2,
  '0.001*"government" + 0.001*"year" + 0.001*"south" + 0.001*"people" + 0.001*"work" + 0.001*"country" + 0.001*"programme" + 0.001*"african" + 0.001*"development" + 0.001*"national"'),
 (3,
  '0.001*"government" + 0.001*"work" + 0.001*"year" + 0.001*"south" + 0.001*"people" + 0.001*"country" + 0.001*"development" + 0.001*"ensure" + 0.001*"programme" + 0.001*"national"'),
 (4,
  '0.000*"year" + 0.000*"south" + 0.000*"work" + 0.000*"people" + 0.000*"government" + 0.000*"national" + 0.000*"programme" + 0.000*"country" + 0.000*"africa" + 0.000*"african"')]

CTM (Correlated Topic Model)

Iteration: 0    Log-likelihood: -6.780588376776796
Iteration: 1    Log-likelihood: -6.475020322569846
Iteration: 2    Log-likelihood: -6.351796610204008
Iteration: 3    Log-likelihood: -6.264457275348859
Iteration: 4    Log-likelihood: -6.192175452293329
Iteration: 5    Log-likelihood: -6.140606396716245
Iteration: 6    Log-likelihood: -6.106193243651031
Iteration: 7    Log-likelihood: -6.079369535386082
Iteration: 8    Log-likelihood: -6.0530939850801095
Iteration: 9    Log-likelihood: -6.031915224628437
Iteration: 10   Log-likelihood: -6.01718511048007
Iteration: 11   Log-likelihood: -6.0155401844366345
Iteration: 12   Log-likelihood: -6.000811161650179
Iteration: 13   Log-likelihood: -5.988095397506353
Iteration: 14   Log-likelihood: -5.982297945699067
Iteration: 15   Log-likelihood: -5.980635839885314
Iteration: 16   Log-likelihood: -5.969616206656491
Iteration: 17   Log-likelihood: -5.959421129878378
Iteration: 18   Log-likelihood: -5.961600332024413
Iteration: 19   Log-likelihood: -5.956358745664617
Iteration: 20   Log-likelihood: -5.9594994598345306
Iteration: 21   Log-likelihood: -5.94811802726593
Iteration: 22   Log-likelihood: -5.953700225008801
Iteration: 23   Log-likelihood: -5.929315360823449
Iteration: 24   Log-likelihood: -5.927746355207438
Iteration: 25   Log-likelihood: -5.92548697446366
Iteration: 26   Log-likelihood: -5.927437476590377
Iteration: 27   Log-likelihood: -5.9245994245910945
Iteration: 28   Log-likelihood: -5.924235894240203
Iteration: 29   Log-likelihood: -5.923294819189692
Iteration: 30   Log-likelihood: -5.927106737032721
Iteration: 31   Log-likelihood: -5.916090240384817
Iteration: 32   Log-likelihood: -5.923188551130781
Iteration: 33   Log-likelihood: -5.930293376383689
Iteration: 34   Log-likelihood: -5.9183470753156575
Iteration: 35   Log-likelihood: -5.92405892514277
Iteration: 36   Log-likelihood: -5.920500693819347
Iteration: 37   Log-likelihood: -5.906577877578806
Iteration: 38   Log-likelihood: -5.91393414401071
Iteration: 39   Log-likelihood: -5.912994148380699
Iteration: 40   Log-likelihood: -5.907373205535699
Iteration: 41   Log-likelihood: -5.908979793180426
Iteration: 42   Log-likelihood: -5.9106809235135955
Iteration: 43   Log-likelihood: -5.901601344467538
Iteration: 44   Log-likelihood: -5.90851675344836
Iteration: 45   Log-likelihood: -5.905337968895539
Iteration: 46   Log-likelihood: -5.915927417216209
Iteration: 47   Log-likelihood: -5.9129279019239425
Iteration: 48   Log-likelihood: -5.917794040816993
Iteration: 49   Log-likelihood: -5.913258033326719
Iteration: 50   Log-likelihood: -5.908302732983964
Iteration: 51   Log-likelihood: -5.909290005552846
Iteration: 52   Log-likelihood: -5.905806089186031
Iteration: 53   Log-likelihood: -5.907041392626991
Iteration: 54   Log-likelihood: -5.911020734347231
Iteration: 55   Log-likelihood: -5.895891835972429
Iteration: 56   Log-likelihood: -5.89783930510641
Iteration: 57   Log-likelihood: -5.910932029650254
Iteration: 58   Log-likelihood: -5.904413156038382
Iteration: 59   Log-likelihood: -5.907374835551119
Iteration: 60   Log-likelihood: -5.912510833096524
Iteration: 61   Log-likelihood: -5.905484680989327
Iteration: 62   Log-likelihood: -5.901225539252123
Iteration: 63   Log-likelihood: -5.886873609464946
Iteration: 64   Log-likelihood: -5.895343465610092
Iteration: 65   Log-likelihood: -5.903393841726788
Iteration: 66   Log-likelihood: -5.894692950766165
Iteration: 67   Log-likelihood: -5.900901182787048
Iteration: 68   Log-likelihood: -5.894710960792393
Iteration: 69   Log-likelihood: -5.897775082498488
Iteration: 70   Log-likelihood: -5.8944490219226635
Iteration: 71   Log-likelihood: -5.8918959247565414
Iteration: 72   Log-likelihood: -5.883890374790767
Iteration: 73   Log-likelihood: -5.894593776675233
Iteration: 74   Log-likelihood: -5.891495291468341
Iteration: 75   Log-likelihood: -5.900776101173418
Iteration: 76   Log-likelihood: -5.899182470238628
Iteration: 77   Log-likelihood: -5.898835789718759
Iteration: 78   Log-likelihood: -5.900480060776556
Iteration: 79   Log-likelihood: -5.908406135555553
Iteration: 80   Log-likelihood: -5.91157138098146
Iteration: 81   Log-likelihood: -5.906731406861992
Iteration: 82   Log-likelihood: -5.903386030040838
Iteration: 83   Log-likelihood: -5.892810151998958
Iteration: 84   Log-likelihood: -5.898786069966888
Iteration: 85   Log-likelihood: -5.892434522070622
Iteration: 86   Log-likelihood: -5.900180603864671
Iteration: 87   Log-likelihood: -5.907864260086327
Iteration: 88   Log-likelihood: -5.900250635663071
Iteration: 89   Log-likelihood: -5.90349964900138
Iteration: 90   Log-likelihood: -5.901171494391635
Iteration: 91   Log-likelihood: -5.909818936089679
Iteration: 92   Log-likelihood: -5.903609671130308
Iteration: 93   Log-likelihood: -5.901566690221497
Iteration: 94   Log-likelihood: -5.896621375628682
Iteration: 95   Log-likelihood: -5.901310076805791
Iteration: 96   Log-likelihood: -5.894278400587281
Iteration: 97   Log-likelihood: -5.882856085319669
Iteration: 98   Log-likelihood: -5.886270841056567
Iteration: 99   Log-likelihood: -5.891420666196774
Topic #0: [('year', 0.044516902416944504), ('public', 0.04345909506082535), ('sector', 0.042665742337703705), ('include', 0.03490849584341049), ('address', 0.026622343808412552), ('plan', 0.023713376373052597), ('implementation', 0.01754283718764782), ('focus', 0.016661332920193672), ('parliament', 0.015339074656367302), ('democratic', 0.013840515166521072)]
Topic #1: [('make', 0.04923662543296814), ('support', 0.02858937717974186), ('life', 0.02700112760066986), ('continue', 0.023648155853152275), ('achieve', 0.016853975132107735), ('billion', 0.016501031816005707), ('like', 0.015001017600297928), ('building', 0.014824545942246914), ('act', 0.014559837989509106), ('freedom', 0.014206892810761929)]
Topic #2: [('people', 0.06193041428923607), ('national', 0.05984874814748764), ('regard', 0.027149254456162453), ('time', 0.026195157319307327), ('water', 0.01864911988377571), ('past', 0.017781758680939674), ('start', 0.01596030220389366), ('human', 0.014572524465620518), ('youth', 0.014225580729544163), ('implement', 0.013791900128126144)]
Topic #3: [('development', 0.06043345108628273), ('need', 0.03524592146277428), ('growth', 0.030785631388425827), ('area', 0.02719990722835064), ('progress', 0.01967862993478775), ('infrastructure', 0.01889152079820633), ('president', 0.018454236909747124), ('resource', 0.016442732885479927), ('set', 0.016092905774712563), ('poverty', 0.015655621886253357)]
Topic #4: [('year', 0.05468795448541641), ('ensure', 0.04753589630126953), ('u', 0.04195380210876465), ('state', 0.032185137271881104), ('society', 0.02730080485343933), ('far', 0.02669026330113411), ('world', 0.026166941970586777), ('issue', 0.01814267970621586), ('great', 0.01727047748863697), ('woman', 0.01718325726687908)]
Topic #5: [('government', 0.10098280012607574), ('service', 0.043241847306489944), ('million', 0.02459902875125408), ('opportunity', 0.022700222209095955), ('improve', 0.017694279551506042), ('capacity', 0.01674487628042698), ('measure', 0.014155595563352108), ('child', 0.011566314846277237), ('small', 0.010875841602683067), ('especially', 0.010703222826123238)]
Topic #6: [('african', 0.05693892762064934), ('programme', 0.05599425360560417), ('africa', 0.04878038167953491), ('business', 0.030316302552819252), ('nation', 0.029972784221172333), ('community', 0.026022329926490784), ('create', 0.022758912295103073), ('crime', 0.02224363386631012), ('land', 0.01829317957162857), ('important', 0.017434384673833847)]
Topic #7: [('work', 0.08179971575737), ('country', 0.07003837078809738), ('new', 0.04483548551797867), ('honourable', 0.02140122652053833), ('security', 0.021224362775683403), ('000', 0.01777554862201214), ('level', 0.017333392053842545), ('build', 0.016449080780148506), ('department', 0.014238301664590836), ('policy', 0.013442421332001686)]
Topic #8: [('economy', 0.03514421358704567), ('investment', 0.02742135524749756), ('process', 0.025425558909773827), ('provide', 0.02473136968910694), ('job', 0.024557823315262794), ('increase', 0.019004305824637413), ('high', 0.01813656836748123), ('international', 0.012930147349834442), ('private', 0.012149184010922909), ('action', 0.011281446553766727)]
Topic #9: [('south', 0.08503616601228714), ('economic', 0.03789225220680237), ('social', 0.0345437116920948), ('project', 0.022118866443634033), ('continue', 0.01832973025739193), ('improve', 0.018241610378026962), ('people', 0.017096057534217834), ('effort', 0.016391100361943245), ('local', 0.01621486246585846), ('year', 0.015598026104271412)]

ATM (Author-Topic Model)

[(0,
  '0.011*"year" + 0.009*"work" + 0.009*"south" + 0.008*"government" + 0.008*"country" + 0.006*"national" + 0.005*"people" + 0.005*"development" + 0.005*"african" + 0.005*"new"'),
 (1,
  '0.009*"south" + 0.008*"year" + 0.007*"national" + 0.007*"programme" + 0.006*"development" + 0.006*"government" + 0.006*"work" + 0.006*"country" + 0.005*"people" + 0.005*"africa"'),
 (2,
  '0.009*"year" + 0.008*"work" + 0.007*"south" + 0.006*"government" + 0.006*"country" + 0.006*"people" + 0.004*"programme" + 0.004*"service" + 0.004*"make" + 0.004*"national"'),
 (3,
  '0.010*"government" + 0.009*"year" + 0.008*"people" + 0.007*"work" + 0.007*"south" + 0.006*"ensure" + 0.006*"programme" + 0.006*"make" + 0.005*"african" + 0.005*"development"'),
 (4,
  '0.013*"year" + 0.011*"government" + 0.009*"people" + 0.008*"south" + 0.008*"work" + 0.007*"country" + 0.007*"national" + 0.006*"development" + 0.006*"african" + 0.006*"programme"')]

References

Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41 (6): 391–407. https://doi.org/https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Hofmann, Thomas. 1999. “Probabilistic Latent Semantic Indexing.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. SIGIR ’99. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/312624.312649.
Lafferty, John, and David Blei. 2005. “Correlated Topic Models.” In Advances in Neural Information Processing Systems, edited by Y. Weiss, B. Schölkopf, and J. Platt. Vol. 18. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2005/file/9e82757e9a1c12cb710ad680db11f6f1-Paper.pdf.